PicAChoo: A Text Analysis Tool for Customizable Feature Selection with Dynamic Composition of Primitive Methods

نویسندگان

  • Jaeseok Myung
  • Jung-Yeon Yang
  • Sang-goo Lee
چکیده

Although documents have hundreds of thousands of unique words, only a small number of words are significantly useful for text analysis. Thus, feature selection has become an important issue to be addressed in various text analysis studies. A number of techniques and algorithms for feature selection are available, but unfortunately, it is hard to say that a certain algorithm overcomes the others, because feature selection results mostly depend on the source documents. We should pick and choose the appropriate algorithm and the best subset of feature words whenever we need to analyze source documents. In this paper, we present a framework named ‘PicAChoo’, which stands for ‘Pick And Choose’ that enables customizable feature selection environments by composing several primitive feature selection methods without hard-coding. As indicated in the name, this framework provides many strategies for extracting appropriate features and allows dynamic compositions among several feature selection methods. In addition, it tries to give users an environment that utilizes linguistic characteristics of textual data, namely part-of-speech, sentence structures, and so on. Finally, we illustrate that selected feature words can be used for various intelligent services.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...

متن کامل

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JSW

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2010